MCIC Wooster, OSU
2024-03-27
We may colloquially refer to any consecutive series of steps in an analysis as a pipeline or workflow.
But here, I am using these words to mean something that can be executed from start to finish with a single command.
What are the advantages of creating such a pipeline, rather than running scripts one-by-one as needed?
Rerunning everything is much easier.
Re-applying the same set of analyses in a different project is much easier.
It ensures you are including all necessary steps.
The pipeline is a form of documentation of the steps taken.
It improves reproducibility, e.g. making it easier for others to repeat your analysis.
How to rerun parts of the pipeline flexibly? This may be necessary after, e.g.:
Some scripts fail for all or some samples.
Adding a sample.
Needing to modify a script or settings somewhere halfway the pipeline.
Batch job submissions are critical in genomics but pose problems:
for sample in ${samples[@]}; do
sbatch scripts/trim.sh data/"$sample".fq > res/"$sample"_trim.fq
sbatch scripts/map.sh res/"$sample"_trim.fq > res/"$sample".bam
done
sbatch scripts/count.sh ${samples[@]} > res/count_table.txtif statements, many arguments to scripts, and Slurm “job dependencies” — but this is very hard to manage for more complex workflows.Perkel 2019 - https://www.nature.com/articles/d41586-019-02619-z
Typically, researchers codify workflows using general scripting languages such as Python or Bash. But these often lack the necessary flexibility.
Workflows can involve hundreds to thousands of data files; a pipeline must be able to monitor their progress and exit gracefully if any step fails. And pipelines must be smart enough to work out which tasks need to be re-executed and which do not.
Pipeline/workflow tools, often called “workflow management systems”, provide ways to formally describe and execute pipelines.
Some of the most commonly used (command-line based) options in bioinformatics are:
Automation
Flexibility, portability and scalability by separating generic pipeline nuts-and-bolts and:
Most workflow tools are small “domain-specific” languages -DSLs-, often a sort of extension of a more general language (Python for Snakemake, Groovy for Nextflow).
Learning one of these tool to write your own pipelines is perhaps only worth it if you plan to somewhat commonly work on genomics/bioinformatics projects.
But there are also publicly available pipelines that you can use.
The “nf-core” initiative (https://nf-co.re) curates a set of best-practice, flexible, and well-documented pipelines written with Nextflow.
It also has some of its own tooling built on top of Nextflow to help contributors create more robust and user-friendly pipelines.
nf-core currently has 58 complete pipelines — these are the four most popular ones:
We will run the nf-core rnaseq pipeline (https://nf-co.re/rnaseq), a reference-based RNA-seq pipeline.
Running a pipeline like this is a bit different (and more involved) than running a typical piece of bioinformatics software — it is, after all, stitching together many operations. Some considerations:
We only need to install Nextflow and download the workflow files.
The pipeline runs all its constituent tools via Singularity containers (this is most common, and recommended) or via Conda environments.
-resume option, the pipeline will check what needs to be (re)run and what doesn’t: this is quite intricate but generally works well.The nf-core rnaseq pipeline is meant for RNA-seq projects that:
That might seem quite specific, but this is by far the most common use RNA-seq use case.
There are two main parts to the kind of RNA-seq data analysis I just described:
The inputs are FASTQ and reference genome files (assembly & annotation), and the outputs include a gene count table and many “QC outputs”.
Read QC and pre-processing
Read QC (FastQC)
Adapter and quality trimming (TrimGalore)
Optional removal of rRNA (SortMeRNA) — off by default, but we will include this
Alignment & quantification
Alignment to the reference genome/transcriptome (STAR)
Gene expression quantification (Salmon)
Post-processing, QC, and reporting
Post-processing alignments: sort, index, mark duplicates (samtools, Picard)
Alignment/count QC (RSeQC, Qualimap, dupRadar, Preseq, DESeq2)
Create a QC/metrics report (MultiQC)
This pipeline is quite flexible and you can turn several steps off, add optional steps, and change individual options for most tools that the pipeline runs.
This part is “bioinformatics-heavy” with large files, a need for lots of computing power, and command-line (Unix shell) programs — it specifically involves:
Read pre-processing
Aligning reads to a reference genome
Quality control (QC) of both the reads and the alignments
Quantifying expression levels
Read pre-processing includes the following steps:
The alignment of reads to a reference genome needs to be “splice-aware”.
Alternatively, you can align to the transcriptome (i.e., all mature transcripts):
Here are some examples of the kinds of things to pay attention to:
At heart, a simple counting exercise once you have the alignments in hand.
But made more complicated by sequencing biases and multi-mapping reads.
Current best-performing tools (e.g. Salmon) do transcript-level quantification — even though this is typically followed by gene-level aggregation prior to downstream analysis.
Fast-moving field
Several very commonly used tools like FeatureCounts (>15k citations) and HTSeq (<18k citations) have become disfavored in the past couple of years, as they e.g. don’t count multi-mapping reads at all.
The second part of RNA-seq data analysis involves analyzing the count table.
In contrast to the first part, this can be done on a laptop and instead is heavier on statistics, data visualization and biological interpretation.
It is typically done with the R language, and common steps include:
Principal Component Analysis (PCA)
Assessing overall sample clustering patterns
Differential Expression (DE) analysis
Finding genes that differ in expression level between sample groups (DEGs)
Functional enrichment analysis
See whether certain gene function groupings are over-represented among DEGs
params.greeting = 'Hello world!'
greeting_ch = Channel.of(params.greeting)
process SPLITLETTERS {
input:
val x
output:
path 'chunk_*'
script:
"""
printf '$x' | split -b 6 - chunk_
"""
}
process CONVERTTOUPPER {
input:
path y
output:
stdout
script:
"""
cat $y | tr '[a-z]' '[A-Z]'
"""
}
workflow {
letters_ch = SPLITLETTERS(greeting_ch)
results_ch = CONVERTTOUPPER(letters_ch.flatten())
results_ch.view { it }
}